Many people enjoy video games today, but what is the market really like from a publisher’s perspective. How do publishers think of genres and does that affect what type of games they publish. Let’s find out.
First let’s gather our tools.
library(rvest)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ readr::guess_encoding() masks rvest::guess_encoding()
## ✖ dplyr::lag() masks stats::lag()
library(knitr)
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
We’re using VGChartz.com and looking at their weekly sales data in 2017. At first we will need make sure our url’s are in order. VGChartz uses numbers to represent the different weeks.
raw_url <- "https://www.vgchartz.com/weekly/"
dayNums17 <- seq(42743, 43100, by = 7)
head(dayNums17)
## [1] 42743 42750 42757 42764 42771 42778
cat("\n")
fixRaw <- function(x){
inProgress_url <- paste0(raw_url, x, "/Global/")
inProgress_url
}
url17 <- fixRaw(dayNums17)
head(url17)
## [1] "https://www.vgchartz.com/weekly/42743/Global/"
## [2] "https://www.vgchartz.com/weekly/42750/Global/"
## [3] "https://www.vgchartz.com/weekly/42757/Global/"
## [4] "https://www.vgchartz.com/weekly/42764/Global/"
## [5] "https://www.vgchartz.com/weekly/42771/Global/"
## [6] "https://www.vgchartz.com/weekly/42778/Global/"
Some parts of their web pages were difficult to isolate, but because it follows a predictable pattern we can just produce that on our own. Here we create our own Date sequence and test them out on some sample urls.
respectvDts17 <- seq(ymd("2017-01-07"), ymd("2017-12-30"), by = "weeks")
head(respectvDts17)
## [1] "2017-01-07" "2017-01-14" "2017-01-21" "2017-01-28" "2017-02-04"
## [6] "2017-02-11"
cat("\n")
rankSeq <- seq(1, 75, 1)
head(rankSeq)
## [1] 1 2 3 4 5 6
testURL <- c("https://www.vgchartz.com/weekly/42743/Global/", "https://www.vgchartz.com/weekly/42750/Global/", "https://www.vgchartz.com/weekly/42757/Global/")
dateIndex <- function(x) {
y <- (strtoi(substr(x,33,37)) - 42743)/7 + 1
y
}
dateIndex(testURL)
## [1] 1 2 3
Consequetive dates and 3 proper indexes extrapolated from our sample urls, great.
Here is a function we will use to scrape our data from the VGChartz website. This data will be for the entire globe for whole year of 2017. We will also bind our created data with what we gather along the way.
VGChartzScrape <- function(url){
page <- url
genre <- page %>% read_html() %>%
html_nodes('.chart table td+ td') %>% html_text() %>% as.data.frame()
publshr <- page %>% read_html() %>%
html_nodes('br+ a') %>% html_text() %>% as.data.frame()
wkSales <- page %>% read_html() %>%
html_nodes('#chart_body .chart td:nth-child(3)') %>% html_text() %>% as.data.frame()
totSales <- page %>% read_html() %>%
html_nodes('#chart_body td:nth-child(4)') %>% html_text() %>% as.data.frame()
rank <- as.data.frame(rankSeq)
thisDay <- respectvDts17[dateIndex(page)]
dates <- as.data.frame(thisDay)
#combine, label
chartX <- cbind(rank, genre, publshr, wkSales, totSales, dates)
names(chartX) <-
c("Rank", "Genre", "Publisher", "Weekly_Sales", "Total_Sales", "Date")
return(chartX)
}
Here we actually do the scraping. Note that one of the page’s data was broken so we will have to omit it from out set. At 75 entries per page and 51 weeks worth of data, our last row should match our calculation.
load17i <- map_df(url17[1:36], VGChartzScrape)
load17ii <- map_df(url17[38:52], VGChartzScrape)
rawGlobalChart17 <- rbind(load17i, load17ii)
slice_sample(rawGlobalChart17, n=9)
tail(rawGlobalChart17)
75 * 51
## [1] 3825
The first table gives us a look at a random sample of our data while the second shows the last few rows. Notice how our last rows do not match up with our expected calculation of 3825. In fact it is exactly double what we wanted. Also note that all the titles and the genres are mixed together in the same column. This was somewhat intentional as the two pieces were difficult to separate in the scraping.
Here we make up a function to cut out all the duplicate data.
#cut what, how many, of every
cutN <- function(what, many, of){
lim <- nrow(what)
i <- 1 + of
acc <- slice(what, 1:many)
while(i <= lim - of + 1){
temp <- slice(what, i:(i + many - 1))
acc <- rbind(acc, temp)
i <- i + of
}
return(acc)
}
rawGlobalChart17a <- cutN(rawGlobalChart17, 75, 150)
slice_sample(rawGlobalChart17a, n=20)
tail(rawGlobalChart17a)
75 * 51
## [1] 3825
Much better, 3825! Also note that in the first table some of our data is missing the sales data. That’s because VGChartz only gives sales data for the first 30 rows. But we are going to look at this compare genre types and publishers later.
Here we strip out the title and genre information.
genre <- str_extract(rawGlobalChart17a$Genre, ", .+")
genre <- gsub(", ", "", genre)
sample(genre, 9)
## [1] "Racing" "Shooter" "Role-Playing" "Action" "Action"
## [6] "Sports" "Role-Playing" "Shooter" "Adventure"
cat("\n")
Title <- str_extract(rawGlobalChart17a$Genre, ".+ \\(")
Title <- gsub(" \\(", "", Title)
sample(Title, 9)
## [1] "Horizon: Zero Dawn"
## [2] "Pokemon Sun/Moon"
## [3] "Fire Emblem Echoes: Shadows of Valentia"
## [4] "NieR Automata"
## [5] "Mario Party: The Top 100"
## [6] "Animal Crossing: New Leaf"
## [7] "LEGO City Undercover"
## [8] "NBA 2K18"
## [9] "UnchartedPS4)"
And now we bind it back together into a new table.
rawGlobalChart17b <- cbind(rawGlobalChart17a, Title)
rawGlobalChart17b <- rawGlobalChart17b %>% mutate(Genre = genre)
slice_sample(rawGlobalChart17b, n=9)
We are going to seperate our data, the part with full titles, publisher, and genres and then another with sales information.
fullGlobalCh17 <- as.data.frame(rawGlobalChart17b$Genre)
fullGlobalCh17 <- cbind(fullGlobalCh17, as.data.frame(rawGlobalChart17b$Publisher))
fullGlobalCh17 <- cbind(fullGlobalCh17, as.data.frame(rawGlobalChart17b$Title))
fullGlobalCh17 <- cbind(fullGlobalCh17, as.data.frame(rawGlobalChart17b$Date))
names(fullGlobalCh17) <- c("Genre", "Publisher", "Title", "Date")
slice_sample(fullGlobalCh17, n=9)
Looking good so far. Now for that separate table with full sales info. We also fix the numbers here.
rawGlobalChart17c <- cutN(rawGlobalChart17b, 30, 75)
rawGlobalChart17c <- rawGlobalChart17c %>%
mutate( Weekly_Sales = gsub(",", "", Weekly_Sales),
Weekly_Sales = as.numeric(Weekly_Sales),
Total_Sales = gsub(",", "", Total_Sales),
Total_Sales = as.numeric(Total_Sales))
globalChart17 <- as_tibble(rawGlobalChart17c)
tail(globalChart17)
nrow(globalChart17)
## [1] 1530
30 * 51
## [1] 1530
Great, we cut out the part we wanted successfully.
Here we look at the top genres for 2017. Looks like shooters are leading sales, with Action, Sports, and Role-Playing in the 2nd tier.
globalChart17 %>%
group_by(Genre) %>%
summarise(Total = sum(Weekly_Sales)) %>%
arrange(desc(Total)) %>%
top_n(11, Total) %>%
ggplot() +
geom_col(aes(x = reorder(Genre, Total), y = Total), fill = "aquamarine3") +
coord_flip() +
scale_y_continuous(labels = unit_format(unit="10M", scale=1e-7))
Here we see the top publishers for 2017. Nintendo is really doing well, with EA and Activision trying to catch up.
globalChart17 %>%
group_by(Publisher) %>%
summarise(Total = sum(Weekly_Sales)) %>%
arrange(desc(Total)) %>%
top_n(11, Total) %>%
ggplot() +
geom_col(aes(x = reorder(Publisher, Total), y = Total), fill = "aquamarine3") +
coord_flip() +
scale_y_continuous(labels = unit_format(unit="10M", scale=1e-7))
So from what we have so far, does that mean Nintendo sells a bunch of shooters? Doesn’t Activision hold Call of Duty, why is third place with a top title in a top category? Let’s look at publishers vs genre and see what we find.
slice_sample(fullGlobalCh17, n=9)
rawAnalyze17 <- count(fullGlobalCh17, Publisher, Genre, sort = TRUE)
head(rawAnalyze17)
Well Activision is doing well with Shooters but not that well. Or at least they couldn’t have released that many shooters in 2017. We must have duplicates to filter out.
Here we isolate for unique entries only.
rawAnalyze17 <- unique(fullGlobalCh17[, 1:3])
rawAnalyze17a <- count(rawAnalyze17, Publisher, Genre, sort = TRUE)
head(rawAnalyze17a, 30)
Much better. But now we see that Nintendo does have a lot of games but in the Role-Playing section. Role-playing is doing well but it certainly not helping Square Enix who was in 7th place above. And sure enough Activision does to a lot of shooters.
Let’s see if we can tie title, genre, and publisher together in a way that makes sense of what the data is saying.
p <- globalChart17 %>%
group_by(Title, Genre, Publisher) %>%
summarise(Total = sum(Weekly_Sales)) %>%
arrange(desc(Total))
## `summarise()` has grouped output by 'Title', 'Genre'. You can override using
## the `.groups` argument.
plot_ly(p, y=p$Genre, x=p$Total, color=p$Publisher)
## No trace type specified:
## Based on info supplied, a 'bar' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#bar
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
This may help but there are too many entries to make see what is happening. (Recommend widening the code window for better graph width)
Let’s limit what we look at to a little less than 25% of what we have, ordered by top sales.
p1 <- head(p, 45)
plot_ly(p1, y=p1$Genre, x=p1$Total, color=p1$Publisher)
## No trace type specified:
## Based on info supplied, a 'bar' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#bar
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
Now this is more useful. We can double click on the legend on the right and isolate for what publishers to look at. Already we see that Activision dominates the shooter sphere, with EA and Nintendo also present. EA is really about sports games and that seems to be where they make most of their money. So why was Nintendo in such a lead over the other companies? Well when we click on them we see that they do really well in most categories, they have a diversified approach. It may seem that Nintendo might also rely on the fact that they own are also a platform manufacturer. But when we look at Sony (who oddly competes with itself) and Microsoft we see that is not really an advantage.
So overall each company has it’s own strategy for the market. Activision really is trying to focus on the top selling shooters, and EA does something similar with sports. But Nintendo doesn’t take that approach really at all and they did the best of all in 2017; interesting.